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Abstract 

We consider a wireless communication system in which N transmitter-receiver pairs want to communicate with 
each other. Each transmitter transmits data at a certain rate using a power that depends on the channel gain to its 
receiver. If a receiver can successfully receive the message, it sends an acknowledgement (ACK), else it sends a 
negative ACK (NACK). Each user aims to maximize its probability of successful transmission. We formulate this 
problem as a stochastic game and propose a fully distributed learning algorithm to find a correlated equilibrium (CE). 
In addition, we use a no regret algorithm to find a coarse correlated equilibrium (CCE) for our power allocation game. 
We also propose a fully distributed learning algorithm to find a Pareto optimal solution. In general Pareto points 
do not guarantee fairness among the users, therefore we also propose an algorithm to compute a Nash bargaining 
solution which is Pareto optimal and provides fairness among users. Einally, under the same game theoretic setup, 
we study these equilibria and Pareto points when each transmitter sends data at multiple rates rather than at a fixed 
rate. We compare the sum rate obtained at the CE, CCE, Nash bargaining solution and the Pareto point and also via 
some other well known recent algorithms. 
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I. Introduction 

In wireless eommunications due to broadcast medium, transmissions by a user cause interference to 
other users. This reduces the transmission rate and/or increases the transmission error of the other users. 
Therefore each user (transmitter) aims to use resources like, power and spectrum efficiently to improve its 
performance which may conflict with the goals of the other users. Thus, in this paper, we present a game 
theoretic approach to obtain optimal power allocations that achieve an equilibrium among the different 
transmitters. 

We consider a single wireless channel which is being shared by multiple transmitter-receiver pairs. It 
is modeled as an interference channel. Its power allocation in non-game theoretic setup has been studied 
in [|Tl-[|3l and via game theory in BU-llH. Power allocation for parallel Gaussian interference channels is 
studied in llH. Convergence of iterated best response for computing Nash equilibrium under some conditions 
on the channel gains is studied for single antenna systems in [|4l, [|5||. In [|6||, convergence of iterative water¬ 
filling for parallel Gaussian interference channels with multiple antennas is studied. Different algorithms are 
presented in |I71, and |[8l to compute a Nash equilibrium for parallel Gaussian interference channels. In S, 
we presented an algorithm to compute a Nash equilibrium for a stochastic game on Gaussian interference 
channels under channel conditions weaker those in [|4l|. 

In general, in a wireless communication system, a user may not have a knowledge about the other users’ 
channel states and their power policies. In such a setup, one needs distributed algorithms which each user 
can use to achieve optimal policies that require less information about the other users. Online learning 
algorithms are such a class of algorithms [fTOll . Some of these algorithms, for example, fictitious play ifT^ . 
are partially distributed algorithms that require some knowledge about other users’ strategies to find an 
equilibrium of the system. On the other hand, there exist fully distributed learning algorithms [fTOll which 
do not need any information about the other users’ strategies or payoffs to find an equilibrium. 

We describe the prior work in the literature of wireless communications that has considered learning for 


optimal power allocation. The problem of minimizing energy consumption in point-to-point communication 
with delay constraints in stochastic and unknown traffic and channel conditions is considered in lfT3]| . This 
problem is modeled as a Markov decision process and solved using online reinforcement learning. 

In IfTdll . orthogonal multiple access channels are considered. The problem of power allocation is modeled 
as a non-cooperative potential game and distributed learning algorithms are proposed. A learning algorithm 
for finding a Nash equilibrium for a multiple-input and multiple-output multiple access channel is proposed 
in ifTSl . 

The problem of minimizing the total transmit power of a parallel Gaussian interference channel subject to 
a minimum signal to interference plus noise ratio (SINK) is considered in IfT^ . Fully distributed algorithms 
based on trial and error are proposed to find a Nash equilibrium and a satisfaction equilibrium. Learning 
in wireless networks is also considered in the game theoretic framework in lfT7]| - lfT9l . We refer to IfTOl for 
more information on game theory and learning algorithms for wireless communications. 

In [fTTlI . a learning algorithm using regrets is proposed to find a CE in a finite game. Unlike fully 
distributed algorithms, this no-regret algorithm requires the knowledge of actions chosen by the other users 
from each play of the game. The same authors presented a fully distributed learning algorithm in [|20ll that 
leads to a CE when the players are not aware of the functional form of their utility functions. 

Fully distributed algorithms to find a Nash equilibrium are developed in [|22]| - ||25]| . The algorithms in If^ . 
||2^ are based on trial and error. Using this algorithm users approach strategies that play a pure strategy 
Nash equilibrium for a high portion of time. For potential and dominance solvable games, reinforcement 
learning algorithms in [|2^ . [|^ converge to a NE. 

We consider a power allocation game on a wireless interference channel. It is neither a potential game 
nor dominance solvable. Even existence of a pure strategy NE is not guaranteed. Therefore, we can not 
use the algorithms in [|2^ . If25]l to obtain an equilibrium point. Thus we propose a variation of the regret 
matching algorithm to find a CE of the proposed game on the wireless communication system without 
knowing the strategies chosen by other users. The algorithms proposed in this paper is fully distributed. 


Apart from learning a non-cooperative equilibrium, learning algorithms for finding a Pareto point also 
exist in literature ([|2^, Il271l l. Such points can substantially outperform a CE. 

A. Contribution 

We make the following contributions in this paper: 

• We propose a fully distributed regret matching algorithm ifTTll that finds a correlated equilibrium (CE) 
of the interference channel. The usual regret matching algorithm is a partially distributed algorithm 
which requires knowledge of the strategies of the other users. We propose a modification of that 
algorithm to convert it into a fully distributed algorithm. We also compare the sum rate at the CE 
obtained by our algorithm with that obtained from the algorithm in COl and we note that our algorithm 
converges faster than the algorithm in [|^ . 

• We use a fully distributed no-regret dynamics to compute a CCE of our power allocation game. In 
general, every CE is a CCE but the converse may not be true. 

• We propose a fully distributed learning algorithm to find a Pareto point for our game and compare 
its sum rates with that of a CE. 

• Even though Pareto points outperform CE, and CCE, fairness among users is not guaranteed at a Pareto 
point. Using a minor variation of the proposed algorithm to compute a Pareto point, we compute a 
Nash bargaining solution which guarantees fairness among users. 

• Eater, we show that we can use the proposed algorithms to compute CE, CCE, Pareto point and Nash 
bargaining solutions when each transmitter sends at multiple rates rather than at a fixed rate. 

This paper is organized as follows. In Section |II1 we describe the system model and define the problem 
in game theoretic framework. We propose and analyse a learning algorithm to find a CE in Section Hill Eor 
our game, we find Pareto points through a fully distributed learning algorithm in Section |Vl We study Nash 
bargaining solutions for our game in Section |Vll We use multiplicative weight algorithm to find CCE of our 
game in Section |IVl In Section IVIIi we extend the algorithms to the power allocation problem where each 




transmitter can transmit at multiple rates. In Section IVTTTl we compare the sum of utilities of all the users 
at a CE and at a Pareto point and also with other algorithms also for some numerical examples. Section 
HXl concludes the paper. 

II. System Model 

We consider a wireless channel being shared by N independent transmitter-receiver pairs. Transmission 
from each transmitter causes interference at other receivers. The transmitted signal from every transmitter 
undergoes fading. The fading gain experienced by the intended signal at a receiver from its corresponding 
transmitter is called the direct link channel gain. Similarly, the fading gain experienced by other unintended 
signals at a receiver is called the cross link channel gain. We model this scenario as a Gaussian interference 
channel with fading, where each receiver perceives the transmitted signal with additive white Gaussian 
noise. 

Let = {h^i \ ..., hn}} be the direct link channel gain alphabet and Hc^ = {gi \ ■ ■ ■ ,gsi} be the 
cross link channel gain alphabet of user i G Let the random variable denote the channel 

gain from transmitter j to receiver i in time slot t which is assumed to be a constant during the slot. 
Observe that E and E 'He'’ for i ^ j. We denote a realization of Hij{t) by hij{t). We 

assume that for a fixed i,j G {1, 2,..., N}, the random variables = 1,2,, are independent and 

identically distributed. We also assume that are statistically independent for any i,j E {1, 2,..., N}, 

and t = 1 , 2 ,.... 

We assume that transmitter i knows in the begin of slot t but not i 7 ^ j; in fact it does not 

know the distribution of Hij{t) also. We also assume that transmitter i has finite power levels ... ,pmi 
to transmit in a slot. This is a typical wireless scenario. Lor example if, a receiver is sending an ACK/NACK 
to its transmitter and it is a time duplex channel, then a transmitter can estimate its direct link channel gain 
but will not know the cross link channel gains; nor will it know the transmit powers used by the other 
transmitters. 


User i transmits r* bits in every channel use at a power level which depends on the direct link channel gain. 




If receiver i successfully receives the message sent in that slot, it sends an ACK (positive acknowledgment), 
else the receiver sends back an NACK (negative acknowledgement) at the end of the slot. We assume that 
ACK messages are small and sent at a low rate such that these are received with negligible probability of 
error and its transmission overheads are ignored. This is typically assumed ll2T]| . For a Gaussian channel, 
the probability of error is a function of the received SINK and the modulation and coding used. For a given 
coding and modulation, we can fix a minimum SINK such that the probability of error is negligible above 
this SINK. To be specific, we assume that if in time slot t, user i transmitted r* bits per channel use at 
a power level I G {1, 2,... ,mj} and Hij{t) = hij{t) then, transmitter i receives an ACK from its 
corresponding receiver if and only if 


A < log( 1 + 


2„(d 


\hii{t)\‘^p\ 


l + Fi 




, i.e., pY/ < - 1, 


and the transmitter receives a NACK, otherwise; where Fj is a constant that depends on the modulation 
and coding used by the ith receiver. In the following we will take Fj = 1 for all i for convenience. 

We consider stationary policies, i.e., the power used by user i in slot t depends only on the channel gain 
but is independent of time t. Thus, we define the feasible action space of user i by 

rii ^ 




< p. 
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( 1 ) 


1=1 


where is the probability distribution of Hait). We note that user i has an average power constraint 
Pi for each feasible action. As the set of power levels of each user is finite, the number of elements 
in is also finite. Let the cardinality of be = Lj. We enumerate the elements of as 

{1, 2,..., Lj}, i.e., when we write G we mean is a feasible power policy of user i for 
A;(®) = 1,..., Lj, and k^‘^\h),h G denotes the power used when the direct link channel gain is h under 
the policy A;^®^. We denote the action space of the set of all users by ^ x • • • x A^^K and an 

action profile of all users by A; = ..., k^^'^). We denote the action space of all users other than user 

i by A^~'''> = X ■ • • X x x • • • x A^^'^ and the action profile of all users other than i by 

A;(“®) G Let k^'^ indicate the action of user i at time t. Also, kt = {kY\i = 1,..., N). A strategy 0* 




of user i is a probability distribution on A^'^\ and a pure strategy is a degenerate probability distribution 
where a eertain action is chosen with probability one. 

In a given time slot t, each user chooses an action that maximizes its probability of successful transmission. 
The strategy of each user influences the probability of successful transmission of every user and hence we 
are interested in finding an equilibrium point. To model this as a game, we define the reward of user i for 
a given action profile in time slot t with direct link channel gain hii{t) as 




' 

1, if transmitter i receives ACK, 

< 


( 2 ) 


0 , else. 

This reward of user i in a given time slot t is random as it depends on the cross link channel gains htj (t) 

and the power levels at which the other users transmit in that slot, which in turn depend on the direct 

link channel gains of those users. The average reward of user i for the action profile is 

1 . 

T—^oo -I- , , 


By the strong law of large numbers this limit exists a.s. and average reward u can be interpreted 

as the probability of successful transmission. The average reward of user i given the mixed strategy (0j, (jy^i) 
is 


keA 

Each player aims to maximize its own average reward or probability of successful transmission and we 
model this scenario as a stochastic game and restrict ourselves to stationary policies. Then the utility of 
user i can be written as 




E 


w 


(i) 






(5) 


where the expectation is with respect to the distribution of random variables Hij for all i,j- Thus the 
stochastic game is equivalent to the one-shot game in which user i maximizes its utility defined in ([5]) and 
the set of correlated equilibria (CE), defined below, for both of these games is same. 






Definition 1. Given a strategic game Q, a joint probability distribution (j){k) is said to be a correlated 
equilibrium if for all i = 1,... ,N, k^^^ E and k^ G we have 




< 0. 




( 6 ) 


We get a eorrelated e-equilibrium if the zero in the above definition is replaeed by e. 

To find a CE of a one-shot game, a regret matehing algorithm is proposed in ifTTll . Regret for user i is 
defined in terms of utility and it is assumed that the funetional form of utility is known to user i. 
In this paper, we assume that user i is not aware of funetional form of uA but knows w\'’\k^^]hii{t)) for 
each t at the end of slot t. We define regret in terms of Hift)) and this definition is equivalent to 

the definition of regret in [fTTlI because. 


u 


k^-'^eA-i 


Y *))limsup;^ Y 

, .. T—^oo , .. 

k( t=iAt ''AkEi) 

limsup- Y 


(7) 
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kEi)^A-i 

T 


t=l 


lim sup - Y (kt, Hift)). 


T—>-oo 


t=l 


Note that as the first summation in (|7]) is finite, we can exchange summation and limit to get ([8]). Thus, 

T 


= limsup —V] \wl"\k^''\U^'>-,hii{t)) - ; hift)) 

T^oo 7 L 


(9) 


where k^^\ f-fj is the utility of user i, that it would have received if the action kA is replaced 

by kA whenever it is played. Similarly, we write wl-^\kA^kb)-^ to denote the reward of user i, by 

playing instead of 'We use the difference Xt{k^'^'>) = w[''\k^'^\k^^^-, Hift)) — to 

define regret. 

Thus the regret-matching algorithm in [fTTll can be described as follows. The regret. 


fcW) = niax {o,-Y \ , 


( 10 ) 


t=i 





where 




( 11 ) 




( 12 ) 


Xt(fc«), if kf'^ = k^\ 

0 , else, 

1 , ifr« <7«(^®;^n(^)), 

0 , else, 

and denotes the aetual interference experienced by user i at time t and ] hait)) is the upper 

bound on interference for successful transmission of user i, and it is given by 

^^^{kP-,hii{t)) = - 1 . 


For every pair of pure strategies k^'^\ G A^'^\ the regret RT{k^'^\ k^) of a user is a nonnegative real 
number that reflects the change in utility received by the user if the choice of pure strategy k^ is replaced 
with at every time instance that the user chose to play upto time T. User i chooses a pure strategy 

according to a probability distribution in which the probability of choosing an action is proportional to its 
regret. Thus, user i chooses a pure strategy according to the distribution 




^RT{k^'^\k^^'>), for fcW 7 ^ k^^ 


(13) 


for a sufficiently large /u. 

It is shown in ifTTI that following the above procedure, the empirical frequencies of actions converge to 
the set of correlated equilibria. 

To implement this algorithm each user i not only needs to know its own actions (transmit powers) but 
also of the other users’ actions and its cross link channel gains (to compute regret), which it does not know. 
Thus, in the next section we modify this algorithm to find a correlated equilibrium where each user updates 
its strategy based on the rewards it received and the actions it chose in the past. 









III. Learning Algorithm to find a Correlated Equilibrium 


The learning algorithm we propose is fully distributed in the sense that every user updates its strategy 
based on its own notions and rewards and independently of the other users’ strategies and rewards. We will 
show that the joint empirical distribution converges to the set of correlated equilibria with probability 1 for 
our algorithm. 

In our problem as the transmitter is not aware of the interference at the corresponding receiver, it can 
not find the regrets as in (fT^ . Therefore, each transmitter estimates regret by estimating the instantaneous 
reward based on the feedback it has received. The estimated reward is a function of the strategy with 
respect to which we want to find the regret for not using instead of 

If kf^ is the pure strategy that is actually chosen at a time t and hii{t) is the direct link channel gain at that 
time, then the actual interference perceived by its receiver is less than the threshold huit)) whenever 

the communication is successful and it is greater than ] huit)) whenever the communication is a 

failure. The user i is optimistic in estimating the rewards for using instead of To formally define 

the estimated reward we use the following notation. For each kk'> g and ha G 

S^it) = G 

S^if) = G 

G ; huit)) = 7«(^®; huit)}. 


User i finds the instantaneous reward wl}\kh\ k^')huit)) that could have been if user i had used action 
kh) instead of k^}'^ at time t, as 


1, if huit)) = 1, 

1, if wl'\ki"h^ huit)) = 0, 

1, if huit)) = 1, 

0, if ; huit)) = 0, 


kh) G 

kh) G Sh\t){JS^^\t). 


wi'\kh\kh).huit)) = < 


(14) 




For every pair of actions and k^'"\ after T slots, the regret RT{k^^\ k^'"^) is 


max 




(15) 


t=l 


where 


At{k^^,U^) = < 


Xtik^^), if fcf = fcW, 


0, else; 

Xt{U^) = K,{t)) - wf>{kf>-hu{t)), 


(16) 


(17) 


and w^\k^‘^\ k^f'] haif)) = wf\kf] hait)), the actual reward received by the user. If kl^ = fc6), i.e., 
is the action chosen by user i at time instant T, then an action ^ in time slot T + 1 is chosen with 
probability, 

^.R'r'(^*'*\ for ^ /j(d^ 

l-E/^.wi^T(fc®,/), iffc» = A)«. 

It should be noted that the quantities w[^\Rt are the estimated values of reward and regret. 


06) i(^W) = < 


(18) 


We define the empirical frequency of strategies chosen upto time T as 


/t(s) = -\{t<T : kt = k\\,k e A. 


(19) 


It is shown in ifTTI that the empirical frequency of strategies converges to the set of the correlated e-equilibria 
if and only if the actual regret i?'r(/c6)^ converges to zero as T ^ cx). This can be formally stated as 


Proposition 1. Let {kt},t = 1,2,... be a sequence of actions chosen by the users. For any e > 0, 
limsup i?r(fc6)^ fc(d) < e for each user i and every /c6)^^(d ^ ^(d n;iY/z fc6) fz ^(d^ if and only if the 

T—>-00 

sequence of empirical frequencies fr converges to the set of correlated e-equilibrium almost surely. □ 

In Proposition [2l we prove that if the estimated regret RT{k^^\ kA) converges to zero then the actual 
regret i?'r(/c6)^ fcfd) also converges to zero. 


Proposition 2. Let {kt},t = 1,2,... be a sequence of actions chosen by the users. For each user i and 
every /c6)^fc(d ^ ^(d /c6) fz ^(d^ if Jim Rj^iJ^k) then lim RT{kb\kA) = 0. 

T— >-CX) T—^OO 






Proof: To prove the proposition, we consider all t such that and prove that 

hii{t)) > w^\k^"'>] hii{t)), ( 20 ) 

for any given k^^\ G A^''\ k^^'> 7 ^ and given channel gain huif). Here, we note that, if 7 ^®) hift)) = 
^^^{k^^]hii{t)) then w\"\k^^^U^^haf)) = w["'^ ] hii{t)). Therefore (l20l) is satisfied and hence in the 

following we consider only the cases where 7W(/cW; hift)) 7 ^ 7W(/i;W; /lii(f)). 

We now consider two cases separately: 

Case 1 : Y{kl"^; hii{t)) < 7 ®(fcW;/iii(f)). 

In this case, it should be noted that w^\k^''\ hift)) = 1 and as haf)) can be either 0 or 1, 

(I 20 I) always holds. 

Case 2 : Y huf)) > 7 ®(fcW;/lii(f)). 

In this case, if w^f\k!f^] huif)) = 1, then by definition wl^\k^'^\ k^^^; hift)) = 1 and (l20l) always holds. 
If ] hii{t)) = 0, then ] hift)) and hence T^^ > ; hift)). Therefore we have 

w[''\k^'^'>; hiiit)) = 0 and (l 20 l) is satisfied with equality. 

Hence, (l20l) always holds and we have ^(*))_ Therefore, 0 < RT{k^^\ < 

and, if RT{k^^\ k^^^) converges to zero as T approaches infinity, then Rxik^'^^ R^^) also 
converges to zero. ■ 

In |l30l, authors have extended the regret-matching algorithm of [fTTll so that one can use a function of 
the regret in the original procedure instead of regret, where the function satisfies certain conditions. We 
can not use that result here, as our estimation does not satisfy the conditions on the function. But we can 
generalize the result in [fTTll . 

Theorem 1. Let the actual regret be defined as in (fTOj). Let the actual utility w'f\k^^'^]Hii{t)) in t l7^) be 
replaced by an estimated utility w^\k^^\k^'^^] Hifit)) such that 


Then following the regret-matching algorithm T7^-TOI) with the actual regret RT{k^''\ replaced by the 


estimated regret RT{k^^\ kk'>), the estimated regret RT{k^'^\ converges to zero as T approaches infinity. 

Following the proof of the main theorem in ifTTll . we can show that for each and the estimated 
regret RT{k^^\ kk'>) converges to zeros as T approaches infinity. Therefore by Proposition [2l we get the 
following theorem. 

Theorem 2. If each user chooses strategies in each time slot according to the algorithm tl77l)-tl7^. then 
the empirical frequency fT converges to the set of correlated e—equilibria for any e > 0. 

In the proof of the main theorem of ffTTI . history up to time T is defined as the actions chosen by all 
users at time instances t = 1,... ,T. To prove that the estimated regret converges to zero following the 
regret-matching algorithm, we just need to redefine the history up to time T as the actions chosen by all 
users along with the direct channel states at time instances t = 1,... ,T. With this definition of history, 
the entire proof of the main theorem in [fTTI . carries over and we can conclude that the estimated regret 
converges to zero. 

The performance of the system at a CE may not be very satisfactory from the overall system point of 
view. Therefore, we also provide a distributed algorithm in the next section which achieves a Pareto point. 
The Pareto points are socially optimal. 

IV. Learning Coarse Correlated Equilibrium 

In this section, we compute a coarse correlated equilibrium which is a generalization of a correlated 
equilibrium. We present the multiplicative weight (MW) algorithm [[^ . ll34ll to compute a CCE of our 
power allocation game. MW algorithm has much less computational complexity per iteration than that of 
the regret matching algorithm presented in Section |nll It also does not require estimation of regret as needed 
in Section |nll Also, it has been observed that the price of anarchy (POA) of a CCE is no worse than that 
of a CE in a large class of games If^ . However, it is also known that for some other classes of games, 
e.g., congestion games, the POA of CCE/CE can be larger compared to NE. 

Erom the definition of CE, condition ® requires that every user minimizes the conditional expectation 


of utility where the eonditioning is on 0 and In CCE, user i eontemplates a deviation knowing 
only the distribution 0. 

Let be the eost of user i and eaeh user ehooses its aetion to minimize the cost. In our power 

allocation problem, we can define cost as negative of the utility, i.e., We 

define the CCE of a cost minimization game as 


Definition 2. A distribution (p on A is said to be a coarse correlated equilibrium if 






( 21 ) 


for each user i G {1, 2, ..., N}, and for all actions G Ai. The distribution f is called a e-coarse 

correlated equilibrium if 






+ 


( 22 ) 


for each user i G {1, 2, ..., N}, for every action k^''\ G Ai. 


Please note that whenever the cost is a random variable, we denote it by rather than In this 
definition, cost is a random variable that depends on the randomly chosen actions 

Every CE is also a CCE and thus the set of CCE is a larger set than the set of CE. There exist no-regret 
learning algorithms to compute a CCE but the notion of regret used to compute a CCE is different from 
that used to compute a CE. The regret defined in Section |II] is known as internal regret and we use external 
regret to compute a CCE which is defined as 


Definition 3. The regret of user i given the pure strategy sequence kf ',..., k^ with respect to an action 
is 


1 

T 


T 

^ ^ E^( ——z) 
t=l 


Cf\kf\k^-^^) 




(23) 


An algorithm in which users update their strategies based on the received cost in such a way that the 
external regret converges to zero is a no-regret algorithm. We now present a no-regret algorithm known as 


multiplicative weight algorithm to compute a CCE. 








In the initial iteration t = 1, each user assigns a weight = 1 to action G Ai. User i chooses 


an action with probability 


Qi 






(24) 


During the iteration t, if is the action chosen by user i in iteration t, then it receives the expected utility 

Based on the received utility, user i updates the weight 

of action , as 

(25) 


For the iteration f + 1, user i chooses an action according to (l24l) with weights replaced by and 
this process is repeated. We have the following convergence result. 


Theorem 3. Following the multiplicative weight update algorithm, there exists a positive integer T such 
that the external regret of user i defined in t l25l) is less than e after T iterations. Let ft = denote 

the outcome distribution at time t and f = ^ J2t=i Then f is a e-coarse correlated equilibrium. 

We use this MW algorithm to find a CCE of our power allocation game. In general to use the MW 
algorithm user i needs to know the expected utility. In our game user i finds it given the history of actions 
and rewards as 

t=l 

Based on user i updates its weights using the MW update and chooses action according to (l24l) with 
Ct\k^^) = -uf’ (fc(*)). Unlike in the algorithm of CE, we do not need to evaluate the estimated reward as the 
MW algorithm does not explicitly depend on the regret defined in (|2^ . But the MW algorithm guarantees 
that the external regret converges to zero. Hence we can apply the MW algorithm to our problem to find 


a CCE. 



V. Pareto Optimal Points 


Definition 4. An action profile k E Ais Pareto optimal if there does not exist another action profile k E A 
such that Ui(k) > ufk) for all i = , N with at least one strict inequality. 

In this section we present a distributed algorithm to find a Pareto optimal point. 

The global maximum of 

N 

W{k) = y^ajufik), 

i=l 

subject to k E A, (26) 

is a Pareto optimal solution, where ai are positive constants [|2^ . We find Pareto points by finding a solution 
of (1261). 

We assume that when a receiver sends an ACK/NACK to its transmitter, all the other transmitters can 
also listen to it without error. This is realistic in many wireless systems because an ACK/NACK message 
is small and is usually transmitted at low rates with very low probability of error. Under this assumption, 
we present a learning algorithm in which users may or may not choose to experiment and update their 
strategies in such a way that improves W{k). 

The algorithm is as follows: 

• Each user i chooses a random action uniformly from All the users use these randomly 
chosen actions for a fixed number T of time slots. Each user i = 1,... , N, follows the procedure 
below sequentially: 

• As user i receives the feedback of other users, it finds the weighted sum W(k) of the utilities 

W(k) = ) . (27) 

i=l i=l / 

At the end of T slots, user i experiment with probability 6. When user i experiments, with probability 
e, chooses an action randomly with uniform probability from other than and with probability 
1 — e it chooses an action other than in the following way: 


o 


In the action k^''\ a power level has been specified for each value of direct link channel gain. User i 
chooses an action randomly from a subset of with feasible actions having higher power level 
than for a channel state with the highest probability of occurrence. If this subset is empty, then 
it chooses an action with higher power level for the channel gain with second highest probability 
of occurrence. 

o If all the direct link channel gains occur with equal probability, then user i chooses an action 
randomly from a subset of A^''\ with feasible actions having higher power level for the maximum 
value of direct link channel gain. If this subset is empty, it chooses an action with higher power 
level for the second maximum direct link channel gain. 

• Let this new action be For the next T time slots, user i uses action and user j uses actions 
for j 7 ^ i User z finds the weighted sum of the utilities of all the users 1U(U*\ k ^~'^'>). If 1U(U*\ ) > 

lU(/c^*\ then user i replaces its action kA with and this new weighted sum of the average 

rewards is taken as a benchmark. If there is no improvement in the weighted sum of the average 
rewards, it randomly selects another action following the procedure described above. Thus each user 
may experiment with upto a maximum of MAX number of actions chosen randomly. 

If e = 1, each user experiments with randomly chosen actions. But, for small e, in our algorithm we 
are selecting an action from the action space of that user by a local search. The local search often yields 
a better point, that improves W{k), than a purely random search in the entire action space, and yields a 
faster rate of convergence as seen in our numerical examples. 

A user updates its action whenever there is an improvement in the weighted sum of average reward over 
the benchmark. Hence, this benchmark of utility is monotonically increasing and bounded above by ^ Oj. 
Therefore, for a sufficiently large M, we find a Pareto optimal point with a large probability. By increasing 
MAX, this probability can be made arbitrarily close to 1. 

Our algorithm is a distributed version of a meta heuristic, stochastic local search OTI . often used for 


global optimization. 


We can also obtain a Pareto point which satisfies certain minimum probability of success (e.g., for voice 
users) by including this constraint in the set A. Pareto points, although they globally maximize W(k), may 
be unfair to some users. Changing the weights ai can alleviate some unfairness. Otherwise, one can obtain 
Pareto points which are Nash bargaining solutions If29ll . which can be obtained via a similar algorithm as 
explained in the next section. 

VI. Nash Bargaining 

In Nash bargaining, we specify a disagreement outcome that specifies utility of each user that it receives 
by playing the disagreement strategy whenever there is no incentive to play the bargaining outcome. Thus, 
by choosing the disagreement outcomes appropriately, the users can ensure certain fairness. 

The Nash bargaining solutions are Pareto optimal and also satisfy certain natural axioms [[361. It is 
shown in ll3^ that for a two player game, there exists a unique bargaining solution (if the feasible region is 
nonempty) that satisfies the axioms stated above and it is given by the solution of the optimization problem 

maximize (si — (ii)(s 2 — 1 ( 2 ), 

subject to Si > di,i = 1, 2, (si, S 2 ) G S. (28) 

For an N-user Nash bargaining problem, this result can be extended and the solution of an N-user bargaining 
problem is the solution of the optimization problem 

maximize — di), 

subject to Si > di,i = 1,..., N, (si,..., sn) G S. (29) 

A Nash bargaining solution is also related to proportional fairness, another fairness concept commonly 
used in communication literature. A utility vector s* G iS is said to be proportionally fair if for any other 
feasible vector s G 5, for each k, the aggregate proportional change Sk — s\/s\ is non-positive [|T7]| . If the 
set S is convex, then Nash bargaining and proportional fairness are equivalent [[37]l . Proportional fairness 
is studied in [1^ when S is non-convex. In our case, S is convex and hence Nash bargaining solution is 


also proportionally fair. 


A major problem in finding a solution of a bargaining problem is ehoosing the disagreement outcome. 


It is more common to consider an equilibrium point as a disagreement outcome. In our problem we can 
consider the utility vector at a CE as the disagreement outcome. We can also choose di = 0 for each i. If 
we choose the disagreement outcome to be a CE, each user needs to evaluate a CE first before running 
the algorithm to find a solution of (l29l) . which requires more computations. Instead, we can choose the 
disagreement outcome to be the zero vector or by using the following procedure : 

• Each user chooses an action that gives higher power level to the channel state that has higher probability 
of occurrence. In other words, among the set of feasible actions, choose a subset of pure strategies that 
gives the highest power level to the channel state with highest probability of occurrence. We shrink 
the subset by considering the actions that give higher power level to the second frequently occurring 
channel state and we repeat this process until we get a single strategy. 

• If all the channel states occur with equal probability, we follow the above procedure by considering 
the value of the channel gain instead of the probabilities of occurrence of the channel gains. 

Let the pure strategy chosen by user i be and assume that the users use these strategies for a fixed 
number of slots. User i finds di by averaging the rewards received in the Td slots, i.e.. 



(30) 


t=i 


Eor our numerical evaluations we have chosen the disagreement outcome following the procedure de¬ 
scribed above instead of choosing the zero vector. To find the bargaining solution, i.e., to solve the 
optimization problem (l29l) . we use the algorithm of Section |V] used to find a Pareto optimal point but 
with objective W{k) defined as 


W{k) = Ul^{u,{k)-di). 


In Section IVIIfi we present a Nash bargaining solution for the numerical examples we consider, and observe 


that the Nash bargaining solution obtained is a Pareto optimal point which provides fairness among the 


users. 




VII. Transmission at Multiple Rates 


Until now, we have presented learning algorithms to eompute a CE, a CCE, Pareto points and Nash 
bargaining solutions, when a user is transmitting at a fixed rate. In this section, we generalize the model 
so that a user can transmit at multiple rates rather than at a fixed rate and show that we can still use the 
same algorithms to compute equilibria. 

Eet = {ri \ ■ ■ ■ be the set of possible transmission rates of user i. Eet {pi \ ... be the 
set of power levels for user i, as considered earlier. We denote the new strategy set as 

< P*|. (31) 

The cardinality of is k times that of A^''\ as every action G A^^'^ can be associated with each rate 
in We enumerate the elements of as in Section UIl Here also we denote an action by e A^''\ 
and is the rate of transmission under the action /c^). If Hu G is the direct link channel gain of 

user i and the user chooses action fcb), then it transmits at a rate with power 

User i receives an ACK if the interference at receiver i satisfies 




Pi‘\ .... Pi;>)|r« € R<-), P,® € {p'*>,... ,p®}, 


Ui 


1 = 1 


jW < 


Hu\^k^^{Hu) 

2r(fc(®)) _ 


(32) 


and it receives a NACK otherwise. We use the same notation 'jA Hu) to denote the upper bound on 
the interference for receiving an ACK. We can redefine the estimated reward and estimated regret as in 
(fT4l) and (fT5l) respectively, but with the threshold redefined as 








1 . 


(33) 


It can easily be seen even in this case that the estimated reward Hu{t)) is greater than or equal 

to the actual reward Hu{t)). Hence, we can use the regret-matching algorithm to compute CE 

for the game with as the strategy set. We can also use the respective algorithms mentioned earlier to 
compute Pareto points, Nash bargaining solution, and CCE. 

We can ensure that these solutions satisfy certain minimum rates by limiting our overall action space to 
strategies that satisfy these rate constraints. 




VIII. Numerical Examples 


In this section we consider three examples with three transmitter-receiver pairs in the communication 
system. In the first example, we consider a symmetric scenario where = {0.2, 0.6,1} and 1-ic'^ = 
(0.1, 0.3, 0.5} for each user i, and each channel state occurs with equal probability. The set of possible 
power levels for each player is {0, 5,10,15, 20, 25, 30}. Each user transmits at a rate of r* = 0.75 bits per 
channel use, and receives feedback from its receiver. Each user follows the learning algorithm (fT7l) - (fT^ to 
find a CE, and finds a Pareto optimal strategy as described in Section |Vl In finding the Pareto points, we 
choose Oj = r, for all i. The sum rate at a CE and at a Pareto point are compared in Eigure [B We also 
compare the sum rate at a CE obtained by using the reinforcement learning (RE) algorithm in GOll . We 
observe that the sum rates at CE obtained via our algorithm and that obtained via the algorithm in [|20ll 
almost coincide in this example. 

Even though the sum rates are close for both the algorithms, we observe that our algorithm convergences 
faster than the RE algorithm. In the RE algorithm, it is required that each pure strategy of each user should 
be played for a minimum number of time slots to find the regret as defined in |[20l . Thus the algorithm 
requires a larger number of iterations to converge to the set of correlated equilibria. In this example, at SNR 
of 15dB, our algorithm converges in about 200,000 iterations, whereas the algorithm in [l20ll converges in 
about 700, 000 iterations. 

In finding a Pareto point, if we randomly choose a strategy (e = 1) each time, instead of local search, 
the algorithm runs for about 150, 000 iterations whereas our local search algorithm finds a Pareto point in 
70, 000 iterations. 

We also plot in Eigure [B the sum rate at a stochastically stable point of the trial and error based algorithm 
(TE) in If22l . It is known from ^22^ that the algorithm therein converges to an efficient NE only if the game 
under consideration has at least one pure strategy NE. In general, we can not guarantee existence of a 
pure strategy NE for our game, and hence the stochastically stable point computed by the algorithm in 
||22]| need not be a NE. Then the algorithm produces stochastically stable points that maximize g{k) = 


aiW{k) — (3iS{k) for all /c G ^ where W{k) = 


S{k) = min{5 ; the benchmark actions constitute a ^-equilibrium}. (34) 

We refer to If22]| for further details of the function g. 

We also plot the sum rate at the CCE and at the Nash bargaining solution obtained for Example 1 in 
Eigure [IJ We observe that the sum rate at a CCE is better than that at a CE, but that the MW algorithm 
runs for about 250, 000 iterations which is more than the number of iterations required for computing a 
CE. The sum rate at the Nash bargaining solution is very close to that at the Pareto point, but the former 
provides fairness among users. We present the rates at the Pareto point and at the Nash bargaining solution 
in Table U to illustrate the fairness provided by the Nash bargaining solution for Example 1. The rates of 
all the three users are mentioned as a triplet (ri,r 2 ,r 3 ), where r* is the rate of user i. It can be seen for 
several SNR values, that the Nash bargaining solution provides more fairness than at the Pareto point. Even 
though Example 1 is symmetric, as the algorithms are based on stochastic local search, rate allocations 
need not be symmetric. 


SNR(dB) 

Rates at Pareto point 

Rates at Nash bargaining 

5 

(0.25, 0.25, 0.25) 

(0.25, 0.25, 0.25) 

7 

(0.67, 0.37, 0.43) 

(0.48, 0.45, 0.46) 

10 

(0.58, 0.77, 0.19) 

(0.49, 0.49, 0.47) 

12 

(0.36, 0.34, 0.89) 

(0.50, 0.51, 0.5) 

15 

(0.55, 0.41, 0.69) 

(0.54, 0.52, 0.55) 


TABLE I. Fairness of rates at the Pareto point and the Nash bargaining solution eor Example 1. 


We note that the sum rate of all the users is higher at the Pareto optimal point than at a CE. The 
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Fig. 1. Sum rates at Pareto points and CE for Example 1. 

improvement is 21.8% at the average transmit SNR constraint of lOdB, and 24.6% at the SNR of 15dB. 

In the computation of CE, for each pure strategy G A^''\ we need to estimate the regret, which requires 
some calculation of a threshold in advance before starting the running of the algorithm. This requires two 
multiplications and two additions per action for each user. But, for one iteration in the MW algorithm, 
each user requires one division per action and two multiplications. Hence, even though the regret-matching 
algorithm requires computation of estimated regret, it may converge faster than the MW algorithm which 
does not require computation of regret. It is observed from examples that the regret-matching has relatively 
less running time than the MW algorithm. 

Next we consider an asymmetric scenario, in Example 2. In this example also we consider = 
{0.2, 0.6,1} and 'Hc'^ = {0.1, 0.3, 0.5} for each user i. The direct link gains from occur with equal 
probability for each user i, but the cross link gains occur with a different probability distribution for each 
user. Eor user 1, the distribution is {0.5, 0.3, 0.2}, for user 2 it is {0.4, 0.5, 0.1}, and for user 3, it is 
{0.25, 0.5, 0.25}. Users 1, 2, and 3 transmit at rates 0.5, 0.75, and 0.9 bits per channel use respectively. The 
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Fig. 2. Sum rates at Pareto points and CE for Example 2. 

sum rate at the CE and at the Pareto point obtained from our algorithm are eompared in Figure [2l We also 
eompare the sum rate at the CE obtained by using the RE algorithm in [|^ . In this example also, we observe 
that our algorithm eonverges faster than the RE algorithm: at SNR of 15dB, our algorithm eonverges in about 
250, 000 iterations, whereas the algorithm in [l20ll eonverges in about 770, 000 iterations. We also plot the 
sum rate at a stoehastieally stable point of the algorithm in ll22ll whieh maximizes g{k) = aiW{k)—l3iS{k). 

We also plot the sum rate at a CCE and at a Nash bargaining solution for Example 2 in Figure [2l 
In this example also, we observe an improvement in the sum rate at a CCE over that at a CE, and the 
MW algorithm runs for about 325, 000 iterations whieh is more than the number of iterations required for 
eomputing a CE. The sum rate at the Nash bargaining solution is very elose to that at the Pareto point. We 
observe that the sum of the rates of all the users is higher at the Pareto optimal point than at a CE. We 
observe an improvement of 22.7% at SNR lOdB and an improvement of 17.5% at SNR of 15dB. The sum 
rate obtained via Il2^ at its stoehastieally stable point is the lowest. 

Finally, we eonsider multiple rates of transmission in Example 3. For Example 3, we eonsider the same 
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Fig. 3. Sum rates at Pareto points and CE for Example 3. 

parameters as in Example 2, but eaeh user ean send data at any rate from the set TZ = {0.75, 0.9,1.2}. We 
eompare the sum rates at a CE, CCE, Pareto point, and Nash bargaining solution, in Eigure [3l We also plot 
the sum rates using the RE algorithm and TE algorithm for this example in Eigure [3] In this example also 
we observe an improvement in the sum rate at the CCE over the sum rate at the CE. Sum rate at the Nash 
bargaining solution and at the Pareto point almost eoineide in this example also. As the eardinality of the 
strategy set of eaeh user is enlarged by transmitting at multiple rates, the regret-matehing algorithm runs 
for about 400, 000 iterations to eompute a CE and the MW algorithm runs for about 475, 000 iterations to 
eompute a CCE, at SNR of 15dB. 


IX. Conclusions 

We have eonsidered a eommunieation system in whieh N transmitter-reeeiver pairs eommunieate on a 
wireless ehannel. Eaeh transmitter sends data at a eertain rate at a power level that is a funetion of the direet 
link ehannel gain and the feedbaek reeeived from its reeeiver, to maximize the probability of sueeessful 










transmission. This scenario is modeled as a stoehastie game and fully distributed learning algorithms are 
proposed to find a correlated equilibrium (CE) and a Pareto point. We have eompared the sum of rates 
of all the users at the CE and the Pareto point, and we observe that the Pareto optimal power allocations 
provide higher probability of sueeessful transmission. We have also compared our algorithms with two other 
recent learning algorithms in literature [l20ll . [l23 . The CE obtained by our algorithm performs as well as 
the CE obtained via the algorithm in [|20ll but our algorithm converges mueh faster. On the other hand the 
performance of our CE is better than the best point obtained by the algorithm in ll22ll . 

We also note in our examples that we ean aehieve a higher sum rate by operating at a CCE than operating 
at a CE but at the expense of more number of iterations to compute it. But, in general, it is not guaranteed 
that a CCE yields a better sum rate than a CE. On the other hand a Nash bargaining solution may be a 
better operating point than an arbitrary Pareto point as it provides fairness among the users. 

Transmitting at multiple rates ean signifieantly improve the sum rate but as the strategy set of each player 
is enlarged, it requires more number of iterations to eonverge to either a Nash bargaining solution or a 
CCE. In praetiee, which algorithm to use depends on the nature of the problem, i.e., if for example, the 
eardinality of the overall action space is not large, then the users ean find a Pareto point more quiekly than 
a CE or CCE. But if the aetion space is large and if it requires to converge to an equilibrium quiekly, one 
ean use regret-matching to converge to a CE. 
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